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REVIEW OF THE EFFECTS OF TEST-BASED RETENTION 
ON STUDENT OUTCOMES OVER TIME: REGRESSION 
DISCONTINUITY EVIDENCE FROM FLORIDA 


Joseph P. Robinson-Cimpian, University of Illinois Urbana-Champaign 


I. Introduction 


A number of jurisdictions have laws requiring the use of standardized achievement 
scores in promotion/retention decisions. In general, these laws and policies set passing 
scores or cut-off scores in order to be automatically considered eligible for promotion to 
the next grade level. In this working paper, the authors use longitudinal data from stu- 
dents in grades 3 to 12 in all public schools in Florida from school years 2003-4 to 2012- 
13 to study the short- and long-term effects of third-grade retention on student outcomes. 


The report uses regression discontinuity methods, which compare students that score very 
close together but fall on opposite sides of the cut-off score. This can create a real-world 
scenario that is similar to a true experimental design. The technique is powerful and is 
aimed at making causal claims. Studies that use such robust methods could influence ear- 
ly-grade test-based promotion/retention policies across the country, not just in Florida. 


Il. Findings and Conclusions of the Report 


Contrary to the conventional wisdom on grade retention, the report finds that students in 
Florida who are retained in third-grade because they just barely failed to attain the state’s 
threshold performed better than students of the same age on next year’s tests of math 
and reading. The study does, however, find that these purported benefits fade over time. 


Ill. The Report’s Rationale for Its Findings and Conclusions 


The report illustrates how less sophisticated methods (e.g., a standard regression approach) 
can lead to underestimates ofthe retention effects, due to not accounting for factors unobserved 
by researchers.” Typically, random assignment to experimental and control groups resolves 
this problem. But this is not always possible on a large scale. Thus, comparing students who 
barely passed with those who barely failed, and who did not vary from each other on key vari- 
ables, allows comparisons that would not otherwise be possible. However, this technique re- 
quires plausibly satisfying several assumptions. These issues are discussed in Section V below. 
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IV. The Report’s Use of Research Literature 


The report cites much of the most directly comparable and appropriate literature. As 
the authors note in discussing their literature and their new study, their findings are at 
odds with the conventional wisdom represented in the seminal meta-analysis by Thom- 
as Holmes,? who found that retained students performed worse on subsequent tests and 
were more likely to drop out. One very important distinction between the current report 
and the individual studies referenced in the meta-analysis is the methodology. The studies 
in Holmes’s analysis compared students across a range of prior achievement levels, and 
some of the works statistically adjusted for achievement (and other) differences between 
retained and promoted students. The current report, however, studies the effects of reten- 
tion at a very narrow window of prior achievement. This is one reason why it may not be 
wholly accurate to compare the findings of prior research with those of the current report. 


The results of this report are, however, consistent with the limited number of re- 
cent quasi-experimental studies focused on third-grade retention. These studies in- 
clude a Chicago study (using an instrumental variables approach) by Jacob and Lefgren 
and cited in the report,+ which found no effect on graduation, and a more recent Texas 
study (using propensity score matching) by Lorence® (not cited in the report), which 
found retained students performed better than promoted students on same-grade tests. 


V. Review of the Report’s Methods 


A strong method for making causal inferences 


The report relies on what is known as a regression discontinuity design (RDD), a tech- 
nique used for making causal inferences from non-experimental data when a thresh- 
old determines or strongly predicts treatment assignment. In the case here, all students 
have a score on the third-grade Florida Comprehensive Achievement Test (FCAT), but 
only those with scores below the state-specified threshold will be flagged for possible re- 
tention according to the policy. Thus, the policy creates a distinction between two sets of 
students—those who just barely failed to attain the threshold, and those who just barely 
attained the threshold—that we can otherwise consider to be virtually identical in all re- 
spects, except that one set just barely attained the threshold and might now be retained 
as a result. In effect, the RDD compares the achievement of these two sets of otherwise 
identical students on later outcomes, such as next-year reading achievement, to esti- 
mate the effect of threshold-induced retention flagging in third grade on that outcome. 


As the report points out, however, other factors affect retention decisions (discussed in 
greater detail in the next subsection), and therefore, not all students who fail to attain 
the threshold are retained. In fact, only about one-third of students who just barely failed 
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to meet the threshold are retained in grade, compared to about 5% of students who just 
barely met it. Thus, the authors used a technique known as instrumental variable (IV) es- 
timation to scale the RDD-based estimate of retention eligibility by the proportion of stu- 
dents actually retained as a direct result of the threshold. The IV approach is often used 
if attaining the threshold predicts but does not determine treatment (in this case, reten- 
tion). For instance, if only half of the people who were assigned to the treatment received 
it, then we would want to adjust the difference in outcomes by the proportion affected, 
because only half of the people who we intended the treatment for actually received it. 


In this specific case, the difference in next-year reading score is about 24 scale-scored 
points, or .065 standard deviations (SDs).° But because failing to meet the threshold in- 
creases the likelihood of retention by only 28 percentage points, the IV approach divides 
the 24-point difference in outcomes by 0.28, which is the proportion of intended re- 
tained students who were actually retained. The logic behind this IV adjustment is that 
the original 24-point difference is assumed to be the average of a larger effect on 28% 
of the students and a zero effect on 72% of the students (whose retention status did not 
hinge on whether they attained the threshold). In this way the 24-point RDD estimate 
is divided by the 28% of affected students to obtain the retention effect of 84 scale score 
points, or .226 SDs, on the next year’s reading test for students whose retention status 
was affected by the threshold. This approach of combining RDD and IV methods is com- 
mon; however, note that the required IV assumption of a zero effect is likely not satis- 
fied, as I discuss in the next subsection, calling the validity of the results into question. 


The report provides a number of alternative models and checks to assess the stabili- 
ty of its results. These include the standard RDD checks, such as ensuring that roughly 
the same numbers of students fall just above and just below the threshold example (see 
Figure 5 in the report) and that the characteristics of students (e.g., gender, lunch sta- 
tus, race) are similar just above and just below the threshold (see Figure 6 in the report). 
The report goes further, though, and presents a compelling analysis demonstrating that 
the students affected by the policy (i.e., the compliers in the IV literature terminology) 
appear similar on demographic factors to students whose retention status is not affected 
by threshold attainment (see Table A-5 in the report). Also, the report takes advantage 
of having comparable data before the policy was implemented (i.e., before 2003) to ex- 
amine whether other changes at the threshold—specifically, a change in label from “lev- 
el 2 reader” to “level 1 reader”—could account for the effects. The report finds that this 
change in label has virtually no effect on outcomes, and thus the observed effects in the 
main analysis are likely attributable to policy interventions around early-grade retention.’ 


Multiple interventions and intervention groups created by the policy: Serious 
threats to the report’s claims 


The report discusses its findings as referring to the “effects of retention,” as if the poli- 
cy creates two sets of students: (1) those who are retained and (2) those who are pro- 
moted. However, the policy® is much more complex, instead creating four sets of stu- 
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dents depending on whether they were retained or promoted and whether their scores 
on the third-grade FCAT were above or below the policy threshold: (1) those who are re- 
tained and receive additional service, including summer reading camp, (2) those who 
are retained and receive no other services, (3) those who are promoted and receive ad- 
ditional services, and (4) those who are promoted and do not receive additional ser- 
vices. Table 1 in this review illustrates the services each set of students receives and 
also provides the percentage of students near the threshold that fall into each category. 


Table 1. Services provided to students according to the policy, based on whether the 
student was retained/promoted and below/above the policy threshold 


Just below threshold Just above threshold 
If student Group 1: Paragraph (7)(b)(1) of the Group 2: The policy does not specify 
is re- policy stipulates that, in addition to _ services for students retained who 
tained... retention itself, students retained attained the threshold. 


under this policy shall receive “in- 
tensive instructional services and 
supports to remediate the identified 
areas of reading deficiency, including 
participation in the school district’s 
summer reading camp as required 
under paragraph (a) and a minimum 
of 90 minutes of daily, uninterrupted, 
scientifically research-based reading 
instruction” and other services that 
may include 


About 33% of those just below About 5% of those just above thresh- 


threshold old 
If student Group 3: Paragraph (6)(b) of policy Group 4: No specified additional ser- 
is pro- stipulates that students promoted vices for this group. 


moted... — with a “good cause exemption” shall 
receive “intensive reading instruction 
and intervention that include spe- 
cialized diagnostic information and 
specific reading strategies to meet the 
needs of each student so promoted” 


About 67% of those just below About 95% of those just above 
threshold threshold 


As shown in Table 1, according to the policy text, students who are retained because their scores 
fell below the policy threshold (i.e., Group 1 in Table 1) are not only retained, but also receive 
“intensive instructional services” including a summer reading camp and at least 90 minutes 
of daily, uninterrupted reading instruction. By contrast, for the 5% of students just above the 
threshold who are retained (i.e., Group 2), the policy does not require that the district provide 
them with any specific services. If the students in Group 2 received the same summer read- 
ing camp and intensive instruction as students in Group 1, then a comparison of these two 
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groups should note that any conclusions refer to the “effects of retention plus summer read- 
ing camp and intensive instructional services” rather than simply the “effects of retention.” 
If, however, the services for Groups 1 and 2 truly are different, then the interpretation is not 
straightforward. Note, however, that the report cannot directly compare these two groups. 


In the lower portion of Table 1, we see that the policy also creates two sets of promoted stu- 
dents. The Florida policy permits students falling short of the test-based threshold to be pro- 
moted provided they have a “good cause exemption,” which includes seven categories, such 
as being an English learner with fewer than two years of ESL instruction or having an IEP 
that invalidates the FCAT score. According to the policy text, falling just shy of the test-based 
threshold and being promoted indicates that an exemption was granted and that “intensive 
reading instruction and intervention” are provided to these students. In these cases (roughly 
two-thirds of students just below the threshold; Group 3 in Table 1), these services (but not 
retention, since they were promoted) can have a positive effect on subsequent achievement. 


Importantly, the use of the IV in this case is inappropriate because it violates the as- 
sumption that the threshold has no direct effect on outcomes other than through reten- 
tion—because it can have a direct effect through the intervention services for promoted 
students with “good cause exemptions” (the vast majority of cases; 67%). That is, for the 
report’s claim about the causal effect of third-grade 

The effects of third-grade  yetention to be valid, there can be no other services 
retention are confounded that change at the policy-threshold. But paragraph 
with the effects of the (6)(b) of the policy stipulates that other services 
intensive intervention. change at the threshold—specifically, among all stu- 
dents who are promoted, those who fell short of the 

cut-off are provided with an intervention that is not specifically provided to those above 
the threshold. (This concern is in addition to the one raised above about how the pol- 
icy is not simply a retention policy, but rather a retention-plus-other-services policy.) 


The degree to which this threat is serious or even fatal to the report’s analysis depends 
on whether the policy is actually being carried out. While the policy states a mandate (the 
district “shall” provide the additional assistance), the report includes no evidence, one 
way or the other, about whether this provision is being enforced and whether the assis- 
tance is actually being provided to help these students. If no assistance is actually be- 
ing provided, then the report’s analyses are not really threatened by the “good cause ex- 
emption” issue. If, however, the assistance is substantial, then this issue invalidates the 
use of the IV and renders the main conclusions of the report correspondingly invalid. 


To summarize this subsection, the policy does more than require retention for students be- 
low the threshold. Rather, it requires that those retained students receive additional services, 
and thus, the effects of the report are—at a minimum—a combination of retention effects and 
effects of the other interventions for retained students (e.g., summer reading camp). In ad- 
dition to this complication, the policy allows students below the threshold to be promoted, 
but it requires that these students receive intensive reading instruction and interventions 
intended to raise their achievement. This additional requirement in the policy text implies 
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that the RDD+IV assumptions are likely not satisfied, and the main estimates of the report 
would thus be uninterpretable. Instead, we could perhaps interpret the RDD-based estimates 
(which are less than 1/3 the estimated size of the RDD+IV estimates) as reflecting a weighted 
average of the effect of retention, summer reading camp, intensive instruction and addition- 
al services for about 28%? of students and the effect of the intensive reading instruction and 
intervention for about 67% of students. In general, though, the report cannot disentangle the 
effects of retention from those of the summer reading camp, the intensive reading instruction, 
or any of the other services allowed by the policy (e.g., reduced teacher-student ratios; men- 
toring or tutoring; more frequent progress monitoring; extended school day, week, or year). 


Limits to generalizability 


Although the RDD method has tremendous potential for allowing researchers to make 
causal inferences in the absence of an experiment, the population to which those infer- 
ences can be easily generalized is extremely limited. That is, if we set aside for the mo- 
ment the “good cause exemption” threat to validity, this study may hold lessons about 
the effects of this retention policy (plus the other services retained students receive) on 
students at and very near the policy threshold, but we do not know if those effects would 
be obtained further away from the threshold, say, to students who score half a standard 
deviation below the threshold. Likewise, we do not know the extent to which students 
half a standard deviation above the threshold would benefit from this retention policy. 


The IV approach (if it is valid) has its own limitations to generalizability, which further 
restrict the population to which inferences can be made. Namely, we cannot generalize 
findings to students that teachers (or parents) would definitely want to retain or definitely 
not want to retain. The former group is small, but the latter seems to be the largest group 
in the data used for the report, constituting roughly two-thirds of students in the data.° 


Combining the RDD and IV restrictions to generalizability, the results of the report can 
only reasonably be generalized to students in Florida retained in third grade who are 
very near the retention/promotion cusp and whose status hangs on which side of the 
cusp their third-grade reading test score falls. Looking at Figure 5 of the report, indicat- 
ing where the threshold lies in relation to the density of observations (i.e., about 1 SD 
below the mean value), the proportion of the full data at the threshold is approximate- 
ly 1%, and of that 1%, only 28% are directly affected by the policy. A quarter of one per- 
cent is not a very large segment of the population. Therefore, if the application of RD- 
D+IV produced valid estimated effects for retention alone (which is questionable given 
the “good cause exemption” interventions), we would still be remiss to think that the es- 
timate for this very small subpopulation should generalize to broad retention policies. 


Short- vs. long-term outcomes: Some complications 


Focusing on the interpretation of the effect estimates, several additional long-term con- 
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siderations come into play. Regarding the short-term outcomes (e.g., test scores one 
year after the third-grade retention decision), the report’s use of the RDD and IV meth- 
ods may estimate the causal effect of retention on the subpopulation discussed above 
(setting aside the “good cause exemption” concerns already noted); however, as the re- 
port looks to longer-term outcomes (e.g., test scores six years after the third-grade re- 
tention decision, graduation), the interpretation of the estimate becomes more obscured. 


The usual suspect for concern regarding longer-term outcomes in longitudinal stud- 
ies—differential attrition—is not so much of a concern here. The report does a nice 
job of assessing the potential for differential attrition based on observable fac- 
tors. The result of that assessment is that differential attrition is likely to be very 
small at most and unlikely to account for much (if any) of the effects fading over time. 


The bigger concern is more intriguing: being retained in third grade lowers the likeli- 
hood of later-grade retention. In other words, third-grade retention itself reduces a stu- 
dent’s chances of being retained in fourth, fifth, sixth, and seventh grades (see Table 6). 
Thus, the effect estimate two years after the third-grade retention decision is not simply 
answering the question “What is the effect of third grade retention on student achieve- 
ment when students should be in grade 5?”, but rather is answering “What is the effect 
of third grade retention (and other services) and a lower chance of fourth grade reten- 
tion on student achievement when students should be in grade 5?” Note that the report 
acknowledges this limitation somewhat and provides an estimate of the corrected effect, 
after accounting for the differential in the proportion of students retained the next year. 
Bringing this limitation to the fore is commendable; however, the method for correct- 
ing the estimate is more assumption-laden and less rigorous than the original estimate. 


More specifically, to correct for this combined effect, the report assumes that retention in 
fourth grade has a similar effect as retention in third grade. Without any evidence in the 
context of Florida’s policy to support this assumption, its validity is uncertain at best and, 
based on prior studies by Jacob and Lefgren" as well as Manacorda,” is likely to be incor- 
rect. That is, it appears that later-grade retention often leads to more negative outcomes 
for graduation. By not properly accounting for these more negative effects, the report may 
understate the long term-benefits of early-grade retention (and, more accurately, also of 
additional services for retained students, as well as the intensive reading intervention for 
students who are promoted due to the “good cause exemptions”). Using their adjustment 
approach, the authors conclude that differential retention in fourth grade accounts for up 
to 33% of the decay in the effect on reading and up to 21% in math. First, however, it may 
be inaccurate to claim this account for “no more than X%” because we do not know if the 
assumptions are valid, and they appear to contradict both the literature and the report’s 
own assertions earlier that would suggest the later-grade retention effects are less posi- 
tive. Second, it is likely inappropriate to apply the third-grade effect estimate to later years 
because, as stated above, the policy indicates that retention in third grade includes other 
services—services that may not be provided if retained in later grades. Third, later-grade 
retention effects can have a compounding effect (i.e., students can be retained in several 
grades and each of them can have lasting effects), but the report does not address this issue. 
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Another important point is that although the report uses data pooled across six cohorts to esti- 
mate the effect one year after the retention decision, the analyses six years out onlyinclude data 
from the oldest cohort. This means that the population of inferences is changing from year to 
year across the analyses, the sample size is shrinking, and the estimates have additional noise. 
These limitations grow more substantial as the report estimates longer-term outcomes, and 
importantly, are in addition to the above concerns regarding the long-term outcome analyses. 
Finally, the analyses examining retention effects on graduation are also only based on the first 
cohort affected by the policy, not the six cohorts used for the short-term outcome analysis. 


VI. Review of the Validity of the Findings and Conclusions 


The report purports to answer the question, “What is the effect of third grade retention 
on student achievement [when students making normal progress should be] in grade g 
[e.g., 4, 5, etc.]?” (p. 8). The report uses a strong design to estimate the effect of third- 
grade retention (and other services, such as a summer reading camp) in Florida based on a 
state-level policy requiring passage of a test-based threshold. That said, the claimed caus- 
al effects are questionable because of services provided to students who are promoted de- 
spite failing to attain the threshold. Beyond this substantial concern, there are other limita- 
tions, such as the population to which the effects can easily be generalized is well less than 
1% of the population represented by the data, and the interpretation of the report’s esti- 
mates become murkier as one moves further away from the time of the retention decision. 


Vil. Usefulness of the Report for Guidance of Policy and Practice 


Because the Florida retention policy stipulates that services must be provided to students 
promoted with a “good cause exemption,” the effects of third-grade retention are confound- 
ed with the effects of the intensive intervention. This calls into question both the estimated 
effects and the utility of this report. Moreover, even if the concern of confounding effects can 
be overcome, there are other factors that limit the utility of the report. First, the estimated 
effects do not simply reflect retention alone, but rather retention plus other services. Sec- 
ond, the generalizability of the effects is very limited, applying only to Florida students at or 
very near the statewide third-grade test threshold for retention and simultaneously directly 
affected by the policy. Third, differences in later grade retention between those retained in 
third grade and those not retained, combined with a sample that begins with six cohorts and 
decreases to a single one, render the longer-term outcome analysis less compelling. Thus, 
the primary strength of this report may lie in its analysis of short-term effects, but even these 
effects must be interpreted as valid for only the students very near to Florida’s test-based 
threshold in third grade and are very likely not valid due to confounding intervention effects. 
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